Honey bee colony loss linked to parasites, pesticides and extreme weather across the United States

Honey bee (Apis mellifera) colony loss is a widespread phenomenon with important economic and biological implications, whose drivers are still an open matter of investigation. We contribute to this line of research through a large-scale, multi-variable study combining multiple publicly accessible data sources. Specifically, we analyzed quarterly data covering the contiguous United States for the years 2015-2021, and combined open data on honey bee colony status and stressors, weather data, and land use. The different spatio-temporal resolutions of these data are addressed through an up-scaling approach that generates additional statistical features which capture more complex distributional characteristics and significantly improve modeling performance. Treating this expanded feature set with state-of-the-art feature selection methods, we obtained findings that, nation-wide, are in line with the current knowledge on the aggravating roles of Varroa destructor and pesticides in colony loss. Moreover, we found that extreme temperature and precipitation events, even when controlling for other factors, significantly impact colony loss. Overall, our results reveal the complexity of biotic and abiotic factors affecting managed honey bee colonies across the United States.


Data treatment
For our linear modeling exercise, after the data processing phase (see the Data processing Section of the main text), we further manipulated the data as follows. Honey bee stressor variables such as "Varroa destructor ", "other pests and parasites", "diseases", "pesticides", and "other" were modeled taking the logit transformation of the proportion of colonies affected by these stressors (similarly to what we did for the response variable; see Statistical model Section of the main text). Regarding honey bee status and stressor data, uninformative predictors for honey bee colony loss were excluded from the analysis, namely: "number of initial colonies", "percentage of lost colonies", "percentage of renovated colonies", "unknown", "number of added colonies", and "number of renovated colonies" (see Table S1). For the weather indexes that we built (see Table S2), "alpha indexes", and "kurtosis" of weather variables and the "green-area index" (see Table S3) were log-transformed to mitigate the skewness of their distributions (see Fig. S7). The "precipitation alpha index" was not considered in our analysis due to its extremely concentrated distribution, and we divided the "norms" of weather indexes by a factor of 10 4 to limit their scale. Observations with any missing entry were excluded from the analysis, bringing the sample size from n = 880 to n = 674. Furthermore, due to strong correlations in the collection of weather indexes that we built in our up-scaling procedure (see Fig. S8), some features were removed at the outset. Specifically, if the absolute pairwise Pearson correlation between two predictors exceeded the cut-off of 0.9, then the variable with higher mean absolute correlation (with respect to all the remaining features) was excluded from the model. This was performed through the findCorrelation() function of the caret R package (Kuhn, 2009). The predictors excluded in this way include: "minimum temperature mean", "maximum temperature entropy", "maximum temperature alpha index", "precipitation mean", and "precipitation kurtosis". See Table S4 for a description of the predictors used in the linear modeling exercise described in the main text.

Robust feature selection parameter tuning
We used mixed-integer programming (MIP) techniques for simultaneous feature selection and outlier detection developed by our group (Insolia et al., 2021) to obtain the regression results presented in Table 1 of the main text (see also the Statistical model Section of the main text). Operationally, one has to estimate a suitable level of trimming k n , which represents the number of points not affecting the fit, as well as a sparsity parameter k p , controlling the number of estimated regression coefficients which are non-zero. We explored several combinations of k n and k p values; in the following we present the results for k n /n ranging from 0 to 15% (with a step size of 5%), and k p ranging from 1 to 30 (with a step size of 1 and counting each dummy variable separately although we use group constraints). For each combination of k n and k p values, we let the MIP algorithm run for 5,000 seconds or stop at a 1% optimality gap.
To measure the overall goodness of fit, we computed the robust information criterion discussed in Insolia et al. (2021) with an additional correction factor based on the truncated normal as in Riani et al. (2022). Figure S16 shows the robust Akaike information criterion (rAIC) path across different trimming levels, where we report k n /n on the x-axis. For each trimming level, we estimate the best model of size 1 ≤ k p ≤ 30, and then select the one with minimum rAIC. Here, we see that lower levels of trimming favor sparser solutions, which is likely due to the effects of outlying cases, and that there is little difference in terms of minimum rAIC for trimming levels equal to 5% or 10%. For this reason, instead of picking the solution associated with the overall minimum rAIC, we also compared predictive performance to guide our choice for the trimming level. Specifically, we randomly split the data and used 80% of the observations as a training set and the remaining 20% as a testing set. For each combination of k n and k p values, this procedure was repeated 8 independent times, and we computed the trimmed root mean square prediction error (TRMSPE) on testing data across models selected by rAIC (with upper trimming equal to the trimming level in use). The left panel of Figure S17 compares the medians plus/minus median absolute deviations (MAD) of the TRMSPE across the 8 random splits for MIP against the ones obtained by the sparseLTS (Alfons et al., 2013) -the latter is an heuristic algorithm for robust feature selection and is computed through the sparseLTS() function of the robustHD R package (Alfons, 2021). The TRMSPE is typically non-increasing, and our MIP procedure outperforms sparseLTS as the trimming level reaches 10%. Moreover, the TRMSPE for MIP has a much smaller variability (which is measured by the MAD) as the contamination level reaches 10%, indicating that this solution is more stable than others. We also took the sparsity of the MIP and sparseLTS solutions into account when comparing predictive performance. The right panel of Figure S17 compares medians and MADs of the selected models of size k p , for each trimming level, across the 8 random splits. For a 10% trimming, MIP not only performs better than sparseLTS in terms of predictive power, but also provides sparser and thus more interpretable solutions (lower median and narrower MAD). Here the median sparsity level for MIP is close to the one that we have found on the full data set (see Figure S16), and it is much more stable than the solutions with a 5% trimming -whose MAD is approximately twice as large.
In light of all these findings, we chose to present the results for a 10% trimming, and we verified that different trimming levels (say, from 5 up to 15%) provide results which are consistent with the ones discussed in the main text. Furthermore, Figure S18 contains several regression diagnostics supporting the fact that the model selected by rAIC with a 10% trimming satisfies the underlying assumptions (for the set of non-outlying cases), and highlights the presence of outlying cases which deserve further investigation (e.g., the presence of some residuals which are 6 standard deviations away from the robust fit). Figure S20 and Table S6 provide further analyses on the outlying cases detected by our MIP.
Computations for this research were performed on the Roar supercomputer of the Institute for Computational and Data Sciences at the Pennsylvania State University. The content of the research is solely the responsibility of the authors and does not necessarily represent the views of the Institute for Computational and Data Sciences. We used basic memory option on the ACI-B cluster with an Intel Xeon 24 core processor at 2.2 GHz and 128 GB of RAM. The multi-thread option in R and Gurobi was limited to a maximum of 24 threads.  (Karl and Koss, 1984  showing that the first quarter typically accounts for a sensibly higher proportion of losses and has a larger variability. The second quarter of each year (which is missing for 2019) generally accounts for lowest levels of losses and reports a lower variability compared to other quarters. Losses tend to increase again during the third and fourth quarters. Only in 2015 median losses across the third quarter are higher than the ones during the fourth quarter, but they have larger variability. to distinguish a pattern, where states belonging to the same climatic region tend to behave quite similarly. The West North Central and Northwest regions report sensibly lower losses (whose medians are smaller than 10%) characterized by a much smaller variability. On the other hand, many states in the Central region, as well as New Mexico (which has the highest median and variability), Arkansas, Kansas, Pennsylvania, Massachusetts, Ohio and Illinois, report a median loss which is higher than 20%.  Fig. S3, although colony loss is generally lower and the differences are less marked. Alabama reports high extreme losses, whose highest level is achieved during 2020 and is also associated to the largest number of added and renovated colonies (results not shown).  Fig. S3 hold also here, although less markedly. West North Central and Northwest regions report higher losses compared the first quarter (whose medians are typically higher than 10%), and most states in Central region report their lowest levels of losses among all quarters.  Fig. S3, although also here honey bee loss is generally lower for most states and the differences are less marked. Kansas reports high levels of losses, which are sensibly higher than other states in the South region and somehow more stable compared to the ones of New Mexico. The latter has a long left tail and reports median losses which are comparable to most states in the Southwest region.          (Alfons et al., 2013) and our mixed-integer programming approach (Insolia et al., 2021). These are computed across trimming proportions kn/n ranging from 0% to 15% (with a step size of 5%) and they are based on 8 independent training/testing splits containing 80% and 20% of the points, respectively. These results are based on data for the years 2015-2019.  (Insolia et al., 2021) with a 10% trimming. (a) Scaled residuals for outlying (red) and non-outlying (blue) cases. (b) Square root of absolute scaled residuals for outlying (red) and non-outlying (blue) cases. (c) Residuals for outlying (red) and non-outlying (blue) cases against fitted values (or predicted values for outlying cases). (d) Fitted values for non-outlying cases (blue) and predicted values for outlying cases (red) against the observed value of the response variable. (e) Scaled residuals against a robust measure of outlying-ness in the predictors' space. The latter is computed using the minimum covariance determinant estimator (Rousseeuw and Driessen, 1999) in the rrcov R package (Todorov and Filzmoser, 2009) where the trimming proportion is set to 10% as for MIP estimates and we used the 0.975 quantile of a chi-square distribution to flag leverage points (i.e., outliers in the predictors' space). Points are grouped as non-outlying cases and non-leverage points (NO-NL, green), outliers but non-leverage points (O-NL, purple), leverage points and non-outlying cases (NO-L, red), and outliers and leverage points (O-L, blue). (f ) Normal QQ-plot for the scaled residuals of non-outlying cases. These results are based on data for the years 2015-2019.   (Breiman, 2001), which is computed through the ranger R package , for the same set of variables used in our linear modeling exercise. We compared an increasing number of trees (from 1000 to 5000 with a step size of 1000) and number of variables to possibly split at in each node (ranging from 5 to 15). Feature importance is based on a permutation approach . This result is based on data covering the years 2015-2019.   The original land-use categories were grouped in 6 major classes: "developed", "forest", "pasture", "rangeland", "crop", and "water". Following the approach in Naug (2009), the "water" class was excluded from our analysis.  Year 2016 binary Tables S1, S2, S3 3 Year 2017 binary Tables S1, S2, S3 4 Year 2018 binary Tables S1, S2, S3 5 Year 2019 binary (reference category) Tables S1, S2, S3 6 Region Central binary (reference category) Figure S1 7 Region East North Central binary Figure S1  8 Region Northeast binary Figure S1  9 Region Northwest binary Figure S1 10 Region South binary Figure S1 11 Region Southeast binary Figure S1 12 Region Southwest binary Figure S1 13 Region West binary Figure S1 14 Region West North Central binary Figure S1 15 Quarter 1 binary Tables S1, S2  16 Quarter 2  binary  Tables S1, S2  17 Quarter 3  binary  Tables S1, S2  18 Quarter 4 binary (reference category) Tables S1, S2  19 Varroa destructor  percentage  Table S1  20 Other pests and parasites  percentage  Table S1  21 Diseases  percentage  Table S1  22 Pesticides  percentage  Table S1  23 Other  percentage  Table S1  continuous Table S2  36 Precipitation norm  continuous  Table S2  37 Precipitation entropy  continuous  Table S2  38 Precipitation skewness  continuous  Table S2  39 Green-area index  continuous  Table S3  Table S5: F -tests (Hastie and Pregibon, 1992) on the contribution of weather indexes for the years 2015-2019. Comparisons across nested fits for the Gaussianboth with and without the 10% outliers detected by our mixed-integer programming approach (Insolia et al., 2021) -and quasi-Poisson models: sample size (n), number of features (p), residual degree of freedom and residual deviance (residual sum of squares for the Gaussian models), difference in the degree of freedom between nested models, deviance (sum of squares for the Gaussian case), F -statistic, and p-value. For minimum and maximum temperatures, as well as precipitation, the associated reduced models exclude the following weather indexes that we built: "norm", "entropy", "skewness", "kurtosis", and "alpha index".  Table S6: Spatio-temporal information for outlying cases with unexpectedly higher or lower losses (i.e., large positive or negative residuals, respectively) detected by our mixed-integer programming approach (Insolia et al., 2021) with a 10% trimming for the years 2015-2019. We consider scaled residuals exceeding a given quantile of the standard normal distribution (first column), and for these points we report the number of observations belonging to each climatic region, state, year and quarter. Table S7: Regression results for the colony loss response against the sole predictor "other pests and parasites" (Model 1), against the sole predictor V. destructor (Model 2), against both "other pests and parasites" and V. destructor (Model 3), and finally against these two predictors and "other" (Model 4) for the years 2015-2019. For each predictor in each fit, we report the corresponding coefficient estimate, standard error, t-statistic and p-value. Outlying cases detected by our mixed-integer programming approach (Insolia et al., 2021) based on a 10% trimming were excluded from the analysis. While the marginal regressions in Models 1-2 have positive coefficients, the coefficient for "other pests and parasites" changes sign in Model 3, and becomes negative and significant after the inclusion of the variable "other" in Model 4.  Table S8: Comparison of estimated regression coefficients' signs across different estimation methods and models (negative/positive signs are reported as red/green cells) for the years 2015-2019. For our linear modeling exercise, as the one used by our mixed-integer programming (MIP) approach (Insolia et al., 2021) in the main text, we consider: OLS : ordinary least squares; glmnet: elastic-net penalty with mixing parameter α = 0.8 (Simon et al., 2011); SCAD: smoothly-clipped absolute deviations penalty (Breheny and Huang, 2011); sparseLTS : lasso penalty with least trimmed of squares loss based on a 10% trimming (Alfons, 2021). For the same model using count data as a response variable (i.e., the number of lost colonies per state and quarter) and an additional offset to capture different scales (i.e., the logarithm of maximum number of colonies per state and quarter), we compare: glmnet-Poisson: elastic-net penalty for Poisson models with mixing parameter α = 0.8 (Simon et al., 2011); snet-NB : SCAD penalty with a ridge-like parameter (also here we set α = 0.8) for negative binomial models (Breheny and Huang, 2011). Overall, these results are fairly consistent across different methods, and with the ones discussed in the main text for MIP. However, MIP provides a sparser and more interpretable solution.   Table S8 for the years 2015-2019. Here each fit is computed on the set of nonoutlying cases detected by our mixed-integer programming (MIP) approach (Insolia et al., 2021) with a 10% trimming. In this setting, results are more consistent across methods and models, and they resemble more closely the ones discusses in the main text based on MIP.   lia et al., 2021) with a 10% trimming, for the years 2015-2019, using state-level controls as opposed to climatic regions. The results are consistent with the ones discussed in the main text, with the difference that the "green-area index" switches sign and results as non-significant, as it partially accounts for state-level variability (see Figure S15). Model with R 2 = 0.65.  , and after the removal of outliers detected by MIP and missing data the sample size reduces to n = 437. For each predictor, we report the corresponding coefficient estimate, standard error, t-statistic and p-value. The model has an R 2 = 0.58. The results are consistent with the ones discussed in the main text for MIP, and only a few lagged terms report a significant coefficient; namely: "lagged V. destructor ", "lagged skewness of minimum temperatures", and "lagged kurtosis of minimum temperatures".  Table S12: Ordinary least squares fit for the features selected by glmnet (Simon et al., 2011) -with mixing parameter α = 0.8 and enforcing the inclusion of features selected by our mixed-integer programming (MIP) approach (Insolia et al., 2021)for a model as in Table 1 of the main text, covering the years 2015-2019, with 55 additional pairwise interaction terms computed among continuous features only (p = 82). Interaction terms with a marginal correlation larger than 0.7 were removed at the outset (as described in the Data treatment Section of the Supplementary Information), reducing the number of predictors to p = 48. Outlying cases detected by our MIP based on a 10% trimming were excluded from the analysis (n = 607). The model has an R 2 = 0.62. Most predictors selected by MIP remain significant, such as "V. destructor ", "pesticides", "kurtosis of maximum temperatures", "entropy of precipitations", "green-area index", etc. The interaction terms with a significant coefficient include the interaction of the "green-area index" with "other", "standard deviation of minimum temperatures", "entropy of precipitations", and "skewness of minimum temperature", as well as the interaction of the "alpha index of minimum temperatures" with "V. destructor " and "standard deviation of minimum temperatures". Overall, these results are consistent with the ones based on MIP which are discussed in the main text.  Table S13: Features selected using the mixed-integer programming procedure described in Insolia et al. (2021) for the years 2015-2021, with corresponding coefficient estimates, standard errors, t-statistics and p-values computed on a subset encompassing 90% of the observations (745 out of 828 are selected as "non-outlying", concurrently with the feature selection). Group-constraints are used to ensure that categorical controls for quarter and climatic region, e.g., the three terms representing quarters, are either all selected or all excluded. Moreover, we imposed the control for years to be retained in the model -as this would at least partially mitigate the effects of any anomalies in the last two years on the overall fit (here the reference category is the year 2021). Overall, this extended analysis confirms the main findings from